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Abstract 

Learning representations for semantic relations is 
important for various tasks such as analogy de¬ 
tection, relational search, and relation classifica¬ 
tion. Although there have been several proposals 
for learning representations for individual words, 
learning word representations that explicitly cap¬ 
ture the semantic relations between words re¬ 
mains under developed. We propose an unsuper¬ 
vised method for learning vector representations 
for words such that the learnt representations are 
sensitive to the semantic relations that exist be¬ 
tween two words. First, we extract lexical patterns 
from the co-occurrence contexts of two words in a 
corpus to represent the semantic relations that ex¬ 
ist between those two words. Second, we represent 
a lexical pattern as the weighted sum of the rep¬ 
resentations of the words that co-occur with that 
lexical pattern. Third, we train a binary classifier 
to detect relationally similar vs. non-similar lexi¬ 
cal pattern pairs. The proposed method is unsuper¬ 
vised in the sense that the lexical pattern pairs we 
use as train data are automatically sampled from a 
corpus, without requiring any manual intervention. 
Our proposed method statistically significantly out¬ 
performs the current state-of-the-art word repre¬ 
sentations on three benchmark datasets for propor¬ 
tional analogy detection, demonstrating its ability 
to accurately capture the semantic relations among 
words. 


1 Introduction 

Representing the semantics of words and relations are funda¬ 
mental tasks in Knowledge Representation (KR). Numerous 
methods for learning distributed word representations have 
been proposed in the NLP community [Turian et al., 2010 


Collobert e t al., 2011HMikoloy et al., 201 3bt IMikolov et al., \ 


2013a; Pennington et al, 2014]. Distributed word representa¬ 
tions have shown to improve perfo rmance in a wide- range of 
tasks such as, machine translation I Cho et al., 2014|, seman¬ 


tic similarity me asurement iMikolov et al., 2013d |Pe nning- 
ton et al , 2014) , and word sense disambiguation ||Huang et 
al., 2012|. 


Despite the impressive performance of representation 
learning methods for individual words, existing methods use 
only co-occurrences between words, ignoring the rich se¬ 
mantic relational structure. The context in which two words 
co-occur provides useful insights into the semantic relations 
that exist between those two words. For example, the sen¬ 
tence ostrich is a large bird not only provides the infor¬ 
mation that ostrich and bird are co-occurring, but also de¬ 
scribes how they are related via the lexical pattern X is a 
large Y, where slot variables X and Y correspond to the 
two words between which the relation holds. If we can 
somehow embed the information about the semantic rela¬ 
tions R that are associated with a particular word w into 
the representation of w, then we can construct richer seman¬ 
tic representation than the pure co-occurrence-based word 
representations. Although the word representations learnt by 
co-occurrence prediction methods IMikolov et al., 2013d 


Pennington et al., 2014] have implicitly captured a certain 


degree of relational structure, it remains unknown how to ex¬ 
plicitly embed the information about semantic relations into 
word representations. 

We propose a method for learning word representations 
that explicitly encode the information about the semantic re¬ 
lations that exist between words. Given a large corpus, we 
extract lexical patterns that correspond to numerous seman¬ 
tic relations that exist between word-pairs (x.j : , Xj). Next, we 
represent each word Xi in the vocabulary by a d-dimensional 
vector Xi £ K . Word representations can be initialized ei¬ 
ther randomly or by using pre-trained word representations. 
Next, we represent a pattern p by the weighted average of 
the vector differences (x, — Xj) corresponding to word- 
pairs ( Xi,Xj ) that co-occur with p in the corpus. This en¬ 
ables us to represent a pattern p by a (-/-dimensional vector 
p £ in the same embedding space as the words. Using 
vector difference between word representations to represent 
semantic relations is motivated by the observations in prior 


work on word representation learning [Mikolov et al., 2013d 
Pennington et al., 20141 where, for example, the difference of 


vectors representing king and queen has shown to be similar 
to the difference of vectors representing man and woman. 

We model the problem of embedding semantic relations 
into word representations as an analogy prediction task 
where, given two lexical patterns, we train a binary classifier 
that predicts whether they are relationally similar. Our pro- 































posed method is unsupervised in the sense that both positive 
and negative training instances that we use for training are 
automatically selected from a corpus, without requiring any 
manual intervention. Specifically, pairs of lexical patterns that 
co-occur with the same set of word-pairs are selected as pos¬ 
itive training instances, whereas negative training instances 
are randomly sampled from pairs of patterns with low rela¬ 
tional similarities. Our proposed method alternates between 
two steps (Algorithm [TJ. In the first step, we construct pat¬ 
tern representations from current word representations. In the 
second step, we predict whether a given pair of patterns is re- 
lationally similar using the computed representations of pat¬ 
terns in the previous step. We update the word representations 
such that the prediction loss is minimized. 

Direct evaluation of word representations is difficult be¬ 
cause there is no agreed gold standard for semantic represen¬ 
tation of words. Following prior work on representation learn¬ 
ing, we evaluate the proposed method using the learnt word 
representations in an analogy detection task. For example, de¬ 
noting the word representation for a word w by v(w), the 
vector tt(king) — w(man) + v (woman) is required to be sim¬ 
ilar to v(queen), than all the other words in the vocabulary. 
Similarity between two vectors is computed by the cosine of 
the angle between the corresponding vectors. The accuracy 
obtained in the analogy detection task with a particular word 
representation method is considered as a measure of its ac¬ 
curacy. In our evaluations, we use three previously proposed 
benchmark datasets for w ord analogy detection: S AT anal¬ 
ogy dataset ||Tur ney, 20051, Google analogy dataset [ Mikolov 
et al., 2013c| , and SemEval analogy dataset I Jurgens et al., 
2012 T The word representations produced by our proposed 
method statistically significantly outperform the current state- 
of-the-art word representation learning methods on all three 
benchmark datasets in an analogy detection task, demonstrat¬ 
ing the accuracy of the proposed method for embedding se¬ 
mantic relations in word representations. 


2 Related Work 

Representing words using vectors (or tensors in general) is 
an essential task in text processing. For example, in distri¬ 
butional semantics [Baroni and Len ci, 2 0101, a word x is 
represented by a vector that contains other words that co¬ 
occur with x in a corpus. Numerous methods for selecting co¬ 
occurrence contexts (e.g. proximity-based windows, depen¬ 
dency relations), and word association measures (e.g. point- 
wise mutual information (PMI), log-likelihood ratio (LLR), 
local mutual information (LLR)) have been proposed [Tur¬ 


ney and Pantel, 20101. Despite the successful applications of 


co-occurrence counting-based distributional word representa¬ 
tions, their high dimensionality and sparsity is often problem¬ 
atic when applied in NLP tasks. Consequently, further post¬ 
processing such as dimensionality reduction, and feature se¬ 
lection is often required when using distributional word rep¬ 
resentations. 

On the other hand, distributed word representation learn¬ 
ing methods model words as (-/-dimensional real vectors and 
learn those vector representations by applying them to solve 
an auxiliary task such as language modeling. The dimen¬ 


sionality d is fixed for all the words in the vocabulary and, 
unlike distributional word representations, is much smaller 
(e.g. d £ [10,1000] in practice) compared to the vocabulary 
size. A pioneering work on word representation learning is 


the neural network language model (NNLMs) iBengio et al., 
20031, where word representations are learnt such that we can 


accurately predict the next word in a sentence using the word 
representations for the previous words. Using backpropaga- 
tion, word vectors are updated such that the prediction error 
is minimized. 

Although NNLMs learn word representations as a by¬ 
product, the main focus on language modeling is to predict 
the next word in a sentence given the previous words, and 
not on learning word representations that capture word se¬ 
mantics. Moreover, training multi-layer neural networks with 
large text corpora is often time consuming. To overcome 
those limitations, methods that specifically focus on learn¬ 
ing word representations that capture word semantics us¬ 
ing large text corpora have been proposed. Instead of using 
only the previous words in a sentence as in language mod¬ 
eling, these methods use all the words in a contextual win¬ 
dow for the prediction task [Collobert et al., 20111. Meth¬ 
ods that use one or no hidden layers are proposed to im¬ 
prove the scalability of the learning algorithms. For exam¬ 
ple, the skip-gram model [Mikolov et al., 2013c| predicts 
the words c that appear in the local context of a word x, 
whereas the continuous bag-of-words model (CBOW) pre¬ 
dicts a word x conditioned on all the words c that appear in 
x’s local context [Mikolov et a/.,_2013al. However, meth¬ 
ods that use global co-occurrences in the entire corpus to 
learn word representations have shown to outperform meth¬ 
ods that use only local co-occurrences {Huang et al., 2012 


Pennington et al., 2014|. Word representations learnt us¬ 


ing above-mentioned representation learning methods have 
shown superior performance over word representations con¬ 


structed using the traditional counting-based methods [Baroni 
\et al., 2014| . 

Word representations can be further classified depending 
on whether they are task-specific or task-independent. For ex¬ 
ample, methods for learning word representations for specific 
tasks such as sentiment classifi catio n [Socher_ef al., 2 011], 
and semantic composition [Hashimoto et al., 20141 have been 
proposed. These methods use label data for the target task to 
train supervised models, and learn word representations that 
optimize the performance on this target task. Whether the 
meaning of a word is task-specific or task-independent re¬ 
mains an interesting open question. Our proposal can be seen 
as a third alternative in the sense that we use task-independent 
pre-trained word representations as the input, and embed the 
knowledge related to the semantic relations into the word rep¬ 
resentations. However, unlike the existing task-specific word 
representation learning methods, we do not require manually 
labeled data for the target task (i.e. analogy detection). 


3 Learning Word Representations 

The local context in which two words co-occur provides use¬ 
ful information regarding the semantic relations that exist be¬ 
tween those two words. For example, from the sentence Os- 












































trich is a large bird that primarily lives in Africa, we can 
infer that the semantic relation IS-A-LARGE exists between 
ostrich and bird. Prior work on relational similarity measure¬ 
ment have successfully used such lexical patterns as features 
to repr esent the semantic relations that e xist between two 
words | Due et al., 2010 Due et al., 2011). According to the 


relational duality hypothesis iBollegala et al., 20101, a se¬ 
mantic relation R can be expressed either extensionally by 
enumerating word-pairs for which R holds, or intensionally 
by stating lexico-syntactic patterns that define the properties 
of R. 

Following these prior work, we extract lexical patterns 
from the co-occurring contexts of two words to represent the 
semantic relations between those two words. Specifically, we 
extract unigrams and bigrams of tokens as patterns from the 
midfix (i.e. the sequence of tokens that appear in between 


the given two words in a context) IBollegala et al., 2007b 


Bollegala et al., 2007a|. Although we use lexical patterns as 
features for representing semantic relations in this work, our 
proposed method is not limited to lexical patterns, and can be 
used in principle with any type of features that represent rela¬ 
tions. The strength of association between a word pair (u, v) 
and a pattern p is measured using the positive pointwise mu¬ 
tual information (PPMI), f(p,u,v), which is defined as fol¬ 
lows. 


f(p, u, v ) = max(0, log 


g(p,*,*)g(*,u,v) 


( 1 ) 


Here, g(p. u, v) denotes the number of co-occurrences be¬ 
tween p and (u, v ), and * denotes the summation taken over 
all words (or patterns) corresponding to the slot variable. We 
represent a pattern p by the set lZ(p) of word-pairs ( u , v) for 
which f(p , u, v ) > 0. Formally, we define IZ(p) and its norm 
\TZ(p)\ as follows, 

R{p) = {(u,v)\f(p,u,v) > 0} (2) 

\R-(p)\= f(P’ u ’ v ) (3) 

(u,v)GtZ(p) 


We represent a word x using a vector x G R' 1 . The dimen¬ 
sionality of the representation, d, is a hyperparameter of the 
proposed method. Prior work on word representation learn¬ 
ing have observed that the difference between the vectors that 
represent two words closely approximates the semantic re¬ 
lations that exist between those two words. For example, the 
vector v (king) — v (queen) has shown to be similar to the vec¬ 
tor r;(man) — v (woman). We use this property to represent a 
pattern p by a vector p € R d as the weighted sum of dif¬ 
ferences between the two words in all word-pairs (u, v ) that 
co-occur with p as follows, 

P= Wp)\ ^ f(p,u,v)(u-v). (4) 

(u,v)eTZ(p) 

For example, consider Fig. [T] where the two word-pairs 
(lion, cat), and ( ostrich, bird ) co-occur respectively with 
the two lexical patterns, p\ = large Ys such as Xs, and 
p 2 = X is a huge Y. Assuming that there are no other co¬ 
occurrences between word-pairs and patterns in the corpus. 



the representations of the patterns pi and p 2 are given respec¬ 
tively by p 1 = Xi — x 2 , and p 2 = x 3 — X 4 . We measure the 
relational similarity between (xi,X 2 ) and (x^^xf) using the 
inner-product p l T p 2 ■ 

We model the problem of learning word representations as 
a binary classification task, where we learn representations 
for words such that they can be used to accurately predict 
whether a given pair of patterns are relationally similar. In 
our previous example, we would leant representations for the 
four words lion, cat, ostrich, and bird such that the similarity 
between the two patterns large Ys such as Xs, and X is a huge 
Y is maximized. Later in Section|3.1 we propose an unsuper¬ 


vised method for selecting relationally similar (positive) and 
dissimilar (negative) pairs of patterns as training instances to 
train a binary classifier. 

Let us denote the target label for two patterns p>\ , p 2 by 
t{PhP 2 ) € { 1 , 0 }, where the value 1 indicates that p\ and 
P 2 are relationally similar, and 0 otherwise. We compute the 
prediction loss for a pair of patterns (pi,pz) as the squared 
loss between the target and the predicted labels as follows. 


1 2 

L(t(pi,p 2 ),Pl,P 2 ) = ^(t(Pl,P 2 ) - ct(Pi P 2 )) ■ 


(5) 


Different non-linear functions can be used as the prediction 
function cr(-) such as the logistic-sigmoid, hyperbolic tan¬ 
gent, or rectified linear units. In our preliminary experiments 
we found hyperbolic tangent, tanh, given by 


o(9) = tanh)#) 


exp(#) — exp(— 9) 
exp (9) + exp (—9) 


( 6 ) 


to work particularly well among those different non- 
linearities. 

To derive the update rule for word representations, let us 
consider the derivative of the loss w.r.t. the word representa¬ 
tion a; of a word x, 

dL _ dL dp, dL dp 2 
dx dp x dx dp 2 dx ’ 


where the partial derivative of the loss w.r.t. pattern represen- 















tations are given by, 

r)T -r -i- 

— = a ' (p x T p 2 )(cr(p 1 T p 2 ) - t(pi,p 2 ))p 2 , (8) 

= cr'(Pi T P 2 )( cr (Pi T P2) - t(pi,p2))p 1 . (9) 

op 2 

Here, tr' denotes the first derivative of tanh, which is given 
by 1 — cr{9) 2 . To simplify the notation we drop the arguments 
of the loss function. 

From Eq. [4]we get, 

Ifx = |7^(p 1 )| (Mpi> u = x ’ v '> ~ Kpuu,v = x)) , ( 10 ) 

= | TZ(p-,)\ ( h ( p2 ' u = x ’ v ) ~ h(P2,u,v = x)) , (11) 

where, 

h(p,u = x,v)= Y f(p,x,v), 

(x,v)€{(u,v)\(u,v)£7Z(p),u=x} 


and 


h(p,u,v = x)= Y f(P,u,x). 

(u,x)€{(u,v)\(il,v)E'JZ(p),v=x} 


Substituting the partial derivatives given by Eqs. 8JTT 
Eq.[7]we get, 


in 


dL 


-T— = \{pi,P 2 )[H(pi,x) Y f(P 2 ,U,v)(u-v) 


dx 

(u,v)eK(p 2 ) 

+H(p 2 ,x) Y f(pi,u,v)(u-v)], (12) 

(u,v)ETZ(pi ) 

where A(pi,p2) is defined as 

Un „ \ _ v'(p 1 T p 2 )(t(pi,P2) - v(p 1 t p 2 )) nr . 

^\PllP2) I'D S' \| ’ ( 1 ^) 

\n{p 1 )\\n{p2)\ 

and H ( p , x) is defined as 

H(p, x) = h(p, u = x, v) — h(p, u, v — x). (14) 


We use stochastic gradient decent (SGD) with learning rate 
adapted by AdaGrad iDuchi et al., 2011) to update the word 
representations. The pseudo code for the proposed method is 
shown in Algorithm |T] Given a set of N relationally similar 

and dissimilar pattern-pairs, t(Pi\p 2 ^)}iLii Al¬ 

gorithm [l] initializes each word xj in the vocabulary with a 
vector Xj G R d . The initialization can be conducted either 
using randomly sampled vectors from a zero mean and unit 
variance Gaussian distribution, or by pre-trained word rep¬ 
resentations. In our preliminary experiments, we found that 
the word vectors learnt by GloVe [Pennington et al ., 20141 
to perform consistently well over random vectors when used 
as the initial word representations in the proposed method. 
Because word vectors trained using existing word representa¬ 
tions already demonstrate a certain degree of relational struc¬ 
ture with respect to proportional analogies, we believe that 
initializing using pre-trained word vectors assists the subse¬ 
quent optimization process. 


Algorithm 1 Learning word representations. 

Input: Training pattern-pairs {(jp^,P { 2 AP\\ p 2 ] )}?=!> 
dimensionality d of the word representations, and the 
maximum number of iterations T. 

Output: Representation Xj G R d , of a word Xj for j = 
1,..., M, where M is the vocabulary size. 

1: Initialize word vectors {xj}^-^. 

2: for t = 1 to T do 
3: for k = 1 to K do 

4: P k = ^(u,v)EK( Pk ) /(Pfc. «. V )( U - V ) 

5: end for 

6: for i = 1 to N do 

7: for j = 1 to M do 

o. ™ dL 

S. Xj — Xj Otj g x , 

9: end for 

10: end for 

11: end for 
12 : return {xjJjLj. 


During each iteration, Algorithm[I]alternates between two 
steps. First, in Lines 3-5, it computes pattern representations 
using Eq. [4] from the current word representations for all the 
patterns (K in total) in the training dataset. Second, in Lines 
6-10, for each train pattern-pair we compute the derivative of 
the loss according to Eq. 12 and update the word represen¬ 
tations. These two steps are repeated for T iterations, after 
which the final set of word representations are returned. 

The computational complexity of Algorithm[l]is 0(TKd+ 
TNMd), where d is the dimensionality of the word represen¬ 
tations. Naively iterating over N training instances and M 
words in the vocabulary can be prohibitively expensive for 
large training datasets and vocabularies. However, in practice 
we can efficiently compute the updates using two tricks: de¬ 
layed updates and indexing. Once we have computed the pat¬ 
tern representations for all K patterns in the first iteration, we 
can postpone the update of a representation for a pattern until 
that pattern next appears in a training instance. This reduces 
the number of patterns that are updated in each iteration to 
a maximum of 2 instead of K for the iterations t > 1. Be¬ 
cause of the sparseness in co-occurrences, only a handful (ca. 
100) of patterns co-occur with any given word-pair. There¬ 
fore, by pre-compiling an index from a pattern to the words 
with which that pattern co-occurs, we can limit the update 
of word representations in Line [8 to a much smaller number 
than M. Moreover, the vector subtraction can be parallized 
across the dimensions. Although the loss function defined by 
Eq.[5]is non-convex w.r.t. to word representations, in practice. 
Algorithm [T| converges after a few (less than 5) iterations. In 
practice, it requires less than an hour to train from a 2 bil¬ 
lion word corpus where we have N = 100, 000, T = 10, 
K = 10, 000 and M = 210,914. 

Lexical patterns contain sequences of multiple words. 
Therefore, exact occurrences of lexical patterns are rare com¬ 
pared to that of individual words even in large corpora. Di¬ 
rectly learning representations for lexical patterns using their 
co-occurrence statistics leads to data sparseness issues, which 


















becomes problematic when applying existing methods pro¬ 
posed for learning representations for single words to learn 
representations for lexical patterns that consist of multiple 
words. The proposal made in Eq. [4] to compute representa¬ 
tions for patterns circumvent this data sparseness issue by in¬ 
directly modeling patterns through word representations. 

3.1 Selecting Similar/Dissimilar Pattern-Pairs 

We use the ukWaC corpu^] to extract relationally similar 
(positive) and dissimilar (negative) pairs of patterns ( Pi,Pj ) 
to train the proposed method. The ukWaC is a 2 billion word 
corpus constructed from the Web limiting the crawl to the .uk 
domain. We select word-pairs that co-occur at least in 50 sen¬ 
tences within a co-occurrence window of 5 tokens. Moreover, 
using a stop word list, we ignore word-pairs that purely con¬ 
sists of stop words. We obtain 210, 914 word-pairs from this 
step. Next, we extract lexical patterns for those word-pairs 
by replacing the first and second word in a word-pair respec¬ 
tively by slot variables X and Y in a co-occurrence window of 
length 5 tokens to extract numerous lexical patterns. We select 
the top occurring 10, 000 lexical patterns (i.e. K = 10, 000) 
for further processing. 

We represent a pattern p by a vector where the elements 
correspond to the PPMI values f(p, u, v ) between p and all 
the word-pairs (u,v) that co-occur with p. Next, we com¬ 
pute the cosine similarity between all pairwise combinations 
of the 10,000 patterns, and rank the pattern pairs in the de¬ 
scending order of their cosine similarities. We select the top 
ranked 50,000 pattern-pairs as positive training instances. We 
select 50, 000 pattern-pairs from the bottom of the list which 
have non-zero similarity scores as negative training instances. 
The reason for not selecting pattern-pairs with zero similar¬ 
ity scores is that such patterns do not share any word-pairs in 
common, and are not informative as training data for updat¬ 
ing word representations. Thus, the total number of training 
instances we select is N = 50,000 + 50, 000 = 100,000. 


Table 1: Word analogy results on benchmark datasets. 


Method 

sem. 

synt. 

total 

SAT 

SemEval 

ivLBL CosAdd 

63.60 

61.80 

62.60 

20.85 

34.63 

ivLBL CosMult 

65.20 

63.00 

64.00 

19.78 

33.42 

ivLBL PairDiff 

52.60 

48.50 

50.30 

22.45 

36.94 

skip-gram CosAdd 

31.89 

67.67 

51.43 

29.67 

40.89 

skip-gram CosMult 

33.98 

69.62 

53.45 

28.87 

38.54 

skip-gram PairDiff 

7.20 

19.73 

14.05 

35.29 

43.99 

CBOW CosAdd 

39.75 

70.11 

56.33 

29.41 

40.31 

CBOW CosMult 

38.97 

70.39 

56.13 

28.34 

38.19 

CBOW PairDiff 

5.76 

13.43 

9.95 

33.16 

42.89 

GloVe CosAdd 

86.67 

82.81 

84.56 

27.00 

40.11 

GloVe CosMult 

86.84 

84.80 

85.72 

25.66 

37.56 

GloVe PairDiff 

45.93 

41.23 

43.36 

44.65 

44.67 

Prop CosAdd 

86.70 

85.35 

85.97 

29.41 

41.86 

Prop CosMult 

86.91 

87.04 

86.98 

28.87 

39.67 

Prop PairDiff 

41.85 

42.86 

42.40 

45.99 

44.88 


the percentage of the correctly answered analogy questions 
out of all the questions in a dataset. We do not skip any ques¬ 
tions in our evaluations. 

Given a proportional analogy a : b :: c : d, we use the 
following measures proposed in prior work for measuring the 
relational similarity between (a, b) and (c, d). 

CosAdd proposed by Mikolov et al. [2013d| ranks candi¬ 
dates c according to the formula 

CosAdd (a:b,c:d) = cos (b — a + c,d), (15) 


and selects the top-ranked candidate as the correct an¬ 
swer. 


CosMult: CosAdd measure can be decomposed into the 
summation of three cosine similarities, where in practice 
one of the three terms often dominates the sum. To over¬ 
come this bias in CosAdd, Levy and Goldberg [20141 
proposed the CosMult measure given by. 


4 Evaluating Word Representations using 
Proportional Analogies 

To evaluate the ability of the proposed method to learn word 
representations that embed information related to semantic 
relations, we apply it to detect proportional analogies. For 
example, consider the proportional analogy, man'.woman :: 
king-.queen. Given, the first three words, a word represen¬ 
tation learning method is required to find the fourth word 
from the vocabulary that maximizes the relational similar¬ 
ity between the two word-pairs in the analogy. Three bench¬ 
mark datasets have been popularly used in prior work for 
evaluating analogies: Google dataset [Mikolov et al., 2013c I 
(10,675 syntactic analogies and 886 9 semantic analogies), 
SemEval da taset 1| Jurgens et al., 2012) (79 questions), and 
SAT dataset I Turney, 2006) (374 questions). For the Google 
dataset, the set of candidates for the fourth word consists of 
all the words in the vocabulary. For the SemEval and SAT 
datasets, each question word-pair is assigned with a limited 
number of candidate word-pairs out of which only one is cor¬ 
rect. The accuracy of a word representation is evaluated by 


CosMult (a: b, c:d) = 


cos (b, d) cos(c, d ) 
cos(a, d) + e 


(16) 


http://wacky.sslmit.unibo.it 


We convert all cosine values x £ [—1,1] to positive val¬ 
ues using the transformation (x + l)/2. Here, e is a small 
constant value to prevent denominator becoming zero, 
and is set to 10 -5 in the experiments. 

PairDiff measures the cosine similarity between the two 
vectors that correspond to the difference of the word 
representations of the two words in each word-pair. It 
follows from our hypothesis that the semantic relation 
between two words can be represented by the vector dif¬ 
ference of their word representations. PairDiff has been 
used by Mikolov et al. 2013d 1 for detecting semantic 
analogies and is given by, 

PairDiff (a : b, c:d) = cos (b — a,d — c). (17) 

5 Experiments and Results 

In Table [I] we compare the proposed method against pre¬ 
viously proposed word representation learning methods: 
ivLBL [Mnih and Kavukcuoglu, 201 3|, skip-gram [Mikolov 


et al., 2013c| , CBOW iMikol ov et al., 2013a| ,~^nd 
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Figure 3: Accuracy of the proposed method on benchmark 

Figure 2: Accuracy on Google dataset when the proposed datasets for dimensionalities of the word representations, 
method is trained using 10k and 100k instances. 


GloVe [Pennington et al., 20141. All methods compared in 
Table [I] are trained on the same ukWaC corpus of 2B tokens 
to produce 300 dimensional word vectors. We use the pub¬ 
licly available implementation^pp] by the original authors for 
training the word representations using the recommended pa¬ 
rameter values. Therefore, any differences in performances 
reported in Table [T] can be directly attributable to the differ¬ 
ences in the respective word representation learning methods. 
In all of our experiments, the proposed method converged 
with less than 5 iterations. 

From Table[l]we see that the proposed method (denoted by 
Prop) achieves the best results for the semantic (sem), syn¬ 
tactic (synt) and their union (total) analogy questions in the 
Google dataset using CosMult measure. For analogy ques¬ 
tions in SAT and SemEval datasets the best performance is 
reported by the proposed method using the PairDiff measure. 
The PairDiff measure computes the cosine similarity between 
the two difference vectors b — a and d — c, ignoring the 
spatial distances between the individual words as opposed to 
CosAdd or CosMult. Recall that in the Google dataset we 
are required to find analogies from a large open vocabulary 
whereas in SAT and SemEval datasets the set of candidates 
is limited to a closed pre-defined set. Relying on direction 
alone, while ignoring spatial distance is problematic when 
considering the entire vocabulary as candidates because, we 
are likely to find candidates d that have the same relation 
to c as reflected by a — b. For example, given the analogy 
man:woman::king:?, we are likely to recover feminine en¬ 
tities, but not necessarily royal ones using PairDiff. On the 
other hand, in both SemEval and SAT datasets, the set of can¬ 
didate answers already contains the related candidates, leav¬ 
ing mainly the direction to be decided. For the remainder of 
the experiments described in the paper, we use CosMult for 
evaluations on the Google dataset, whereas PairDiff is used 
for the SAT and SemEval datasets. Results reported in Table[l] 
reveal that according to the binomial exact test with 95% con- 


"https://code.google.com/p/word2vec/ 

'http://nip.Stanford.edu/projects/glove/ 


fidence the proposed method statistically significantly outper¬ 
forms GloVe, the current state-of-the-art word representation 
learning method, on all three benchmark datasets. 

To study the effect of the train dataset size on the perfor¬ 
mance of the proposed method, following the procedure de¬ 
scribed in Section [XT] we sample two balanced datasets con¬ 
taining respectively 10,000 and 100, 000 instances. Figure [2] 
shows the performance reported by the proposed method on 
the Google dataset. We see that the overall performance in¬ 
creases with the dataset size, and the gain is more for syntac¬ 
tic analogies. This result can be explained considering that se¬ 
mantic relations are more rare compared to syntactic relations 
in the ukWaC corpus, a generic web crawl, used in our exper¬ 
iments. However, the proposed train data selection method 
provides us with a potentially unlimited source of positive 
and negative training instances which we can use to further 
improve the performance. 

To study the effect of the dimensionality d of the represen¬ 
tation on the performance of the proposed method, we hold 
the train data size fixed and produce word representations for 
different dimensionalities. As shown in Figure [3] the perfor¬ 
mance increases until around 600 dimensions on the Google, 
and the SAT datasets after which it stabilizes. The perfor¬ 
mance on the SemEval dataset remains relatively unaffected 
by the dimensionality of the representation. 

6 Conclusions 

We proposed a method to learn word representations that 
embeds information related to semantic relations between 
words. A two step algorithm that alternates between pat¬ 
tern and word representations was proposed. The proposed 
method significantly outperforms the current state-of-the-art 
word representation learning methods on three datasets con¬ 
taining proportional analogies. 

Semantic relations that can be encoded as attributes in 
words are only a fraction of all types of semantic relations. 
Whether we can accurately embed semantic relations that in¬ 
volve multiple entities, or semantic relations that are only ex- 
trinsically and implicitly represented remains unknown. We 











plan to explore these possibilities in our future work. 

References 

[Baroni and Lenci, 2010] Marco Baroni and Alessandro 
Lenci. Distributional memory: A general framework 
for corpus-based semantics. Computational Linguistics, 
36(4):673-721, 2010. 

[Baroni et al., 2014] Marco Baroni, Georgiana Dinu, and 
German Kruszewski. Don’t count, predict! a systematic 
comparison of context-counting vs. context-predicting se¬ 
mantic vectors. In ACL’14, pages 238-247, 2014. 

[Bengio et al., 2003] Yoshua Bengio, Rejean Ducharme, 
Pascal Vincent, and Christian Jauvin. A neural proba¬ 
bilistic language model. Journal of Machine Learning Re¬ 
search, 3:1137 - 1155, 2003. 

[Bollegala et al., 2007a] D. Bollegala, Y. Matsuo, and 
M. Ishizuka. An integrated approach to measuring 
semantic similarity between words using information 
available on the web. In Proceedings of NAACL HLT, 
pages 340-347, 2007. 

[Bollegala et al., 2007b] Danushka Bollegala, Yutaka Mat¬ 
suo, and Mitsuru Ishizuka. Websim: A web-based seman¬ 
tic similarity measure. In Proc. of 21st Annual Conference 
of the Japanese Society of Artitificial Intelligence , pages 
757-766,2007. 

[Bollegala et al., 2010] Danushka Bollegala, Yutaka Matsuo, 
and Mitsuru Ishizuka. Relational duality: Unsupervised 
extraction of semantic relations between entities on the 
web. In WWW 2010. pages 151 - 160, 2010. 

[Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer, 
Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol- 
ger Schwenk, and Yoshua Bengio. Learning phrase rep¬ 
resentations using rnn encoder-decoder for statistical ma¬ 
chine translation. In EMNP’14, pages 1724-1734, 2014. 

[Collobert et al., 2011] Ronan Collobert, Jason Weston, 
Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and 
Pavel Kuska. Natural language processing (almost) from 
scratch. Journal of Machine Learning Research, 12:2493 
-2537,2011. 

[Due et al., 2010] Nguyen Tuan Due, Danushka Bollegala, 
and Mitsuru Ishizuka. Using relational similarity between 
word pairs for latent relational search on the web. In 
IEEE/WIC/ACM International Conference on Web Intelli¬ 
gence and Intelligent Agent Technology, pages 196 - 199, 
2010 . 

[Due et al., 2011] Nguyen Tuan Due, Danushka Bollegala, 
and Mitsuru Ishizuka. Cross-language latent relational 
search: Mapping knowledge across languages. In Proc. 
of the Twenty-Fifth AAAI Conference on Artificial Intelli¬ 
gence, pages 1237 - 1242, 2011. 

[Duchi et al., 2011] John Duchi, Elad Hazan, and Yoram 
Singer. Adaptive subgradient methods for online learning 
and stochastic optimization. Journal of Machine Learning 
Research, 12:2121-2159, July 2011. 


[Hashimoto et al., 2014] Kazuma Hashimoto, Pontus Stene- 
torp, Makoto Miwa, and Yoshimasa Tsuruoka. Jointly 
learning word representations and composition functions 
using predicate-argument structures. In EMNLP’14, pages 
1544-1555, 2014. 

[Huang et al., 2012] Eric H. Huang, Richard Socher, 
Christopher D. Manning, and Andrew Y. Ng. Improving 
word representations via global context and multiple word 
prototypes. In ACL’12, pages 873 - 882, 2012. 

[Jurgens et al., 2012] David A. Jurgens, Saif Mohammad, 
Peter D. Turney, and Keith J. Holyoak. Measuring degrees 
of relational similarity. In SemEval’ 12, 2012. 

[Levy and Goldberg, 2014] Omer Levy and Yoav Goldberg. 
Linguistic regularities in sparse and explicit word repre¬ 
sentations. In CoNLL, 2014. 

[Mikolov et al., 2013a] Tomas Mikolov, Kai Chen, and Jef¬ 
frey Dean. Efficient estimation of word representation in 
vector space. CoRR, 2013. 

[Mikolov et al., 2013b] Tomas Mikolov, Quoc V. Le, and 
Ilya Sutskever. Exploiting similarities among languages 
for machine translation. arXiv, 2013. 

[Mikolov et al., 2013c] Tomas Mikolov, Ilya Sutskever, Kai 
Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed 
representations of words and phrases and their composi- 
tionality. In NIPS, pages 3111 - 3119, 2013. 

[Mikolov et al., 2013d] Tomas Mikolov, Wen tau Yih, and 
Geoffrey Zweig. Linguistic regularities in continous space 
word representations. In NAACL’13, pages 746 - 751, 
2013. 

[Mnih and Kavukcuoglu, 2013] Andriy Mnih and Koray 
Kavukcuoglu. Learning word embeddings efficiently with 
noise-contrastive estimation. In NIPS, 2013. 

[Pennington et al., 2014] Jeffery Pennington, Richard 
Socher, and Christopher D. Manning. Glove: global 
vectors for word representation. In EMNLP, 2014. 

[Socher et al., 2011] Richard Socher, Jeffrey Pennington, 
Eric H. Huang, Andrew Y. Ng, and Christopher D. Man¬ 
ning. Semi-supervised recursive autoencoders for predict¬ 
ing sentiment distributions. In EMNLP’ 11, pages 151— 
161,2011. 

[Turian et al., 2010] Joseph Turian, Lev Ratinov, and Yoshua 
Bengio. Word representations: A simple and general 
method for semi-supervised learning. In ACL, pages 384 
-394,2010. 

[Turney and Pantel, 2010] Peter D. Turney and Patrick Pan- 
tel. From frequency to meaning: Vector space models of 
semantics. Journal of Artificial Intelligence Research, 
37:141 - 188,2010. 

[Turney, 2005] P.D. Turney. Measuring semantic similarity 
by latent relational analysis. In Proc. of IJCAI’05, pages 
1136-1141,2005. 

[Turney, 2006] P.D. Turney. Similarity of semantic relations. 
Computational Linguistics, 32(3):379^116, 2006. 



